Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# load the data
from google.colab import files
import io
try:
uploaded
except NameError:
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded["used_device_data.csv"]))
Saving used_device_data.csv to used_device_data.csv
# return the first five rows
df.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 4.307572 | 4.715100 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 5.162097 | 5.519018 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 5.111084 | 5.884631 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 5.135387 | 5.630961 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 4.389995 | 4.947837 |
# return the last five rows
df.tail()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3449 | Asus | Android | 15.34 | yes | no | NaN | 8.0 | 64.0 | 6.0 | 5000.0 | 190.0 | 2019 | 232 | 4.492337 | 6.483872 |
| 3450 | Asus | Android | 15.24 | yes | no | 13.0 | 8.0 | 128.0 | 8.0 | 4000.0 | 200.0 | 2018 | 541 | 5.037732 | 6.251538 |
| 3451 | Alcatel | Android | 15.80 | yes | no | 13.0 | 5.0 | 32.0 | 3.0 | 4000.0 | 165.0 | 2020 | 201 | 4.357350 | 4.528829 |
| 3452 | Alcatel | Android | 15.80 | yes | no | 13.0 | 5.0 | 32.0 | 2.0 | 4000.0 | 160.0 | 2020 | 149 | 4.349762 | 4.624188 |
| 3453 | Alcatel | Android | 12.83 | yes | no | 13.0 | 5.0 | 16.0 | 2.0 | 4000.0 | 168.0 | 2020 | 176 | 4.132122 | 4.279994 |
# returns the number of rows by the number of columns
df.shape
(3454, 15)
Observations: There are 3454 rows and 15 columns
# print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
Observatios:
Most of the columns in the data are numeric in nature ('int64' or 'float64' type).
# drop "brand_name" and "os" columns
#columns_to_drop = ["brand_name", "os"]
#df = df.drop(columns_to_drop, axis=1)
# check missing values across each columns
df.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
Observations:
# check statistical summary of the all data
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3454 | 34 | Others | 502 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3454 | 4 | Android | 3214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3454.0 | NaN | NaN | NaN | 13.713115 | 3.80528 | 5.08 | 12.7 | 12.83 | 15.34 | 30.71 |
| 4g | 3454 | 2 | yes | 2335 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3454 | 2 | no | 3302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3275.0 | NaN | NaN | NaN | 9.460208 | 4.815461 | 0.08 | 5.0 | 8.0 | 13.0 | 48.0 |
| selfie_camera_mp | 3452.0 | NaN | NaN | NaN | 6.554229 | 6.970372 | 0.0 | 2.0 | 5.0 | 8.0 | 32.0 |
| int_memory | 3450.0 | NaN | NaN | NaN | 54.573099 | 84.972371 | 0.01 | 16.0 | 32.0 | 64.0 | 1024.0 |
| ram | 3450.0 | NaN | NaN | NaN | 4.036122 | 1.365105 | 0.02 | 4.0 | 4.0 | 4.0 | 12.0 |
| battery | 3448.0 | NaN | NaN | NaN | 3133.402697 | 1299.682844 | 500.0 | 2100.0 | 3000.0 | 4000.0 | 9720.0 |
| weight | 3447.0 | NaN | NaN | NaN | 182.751871 | 88.413228 | 69.0 | 142.0 | 160.0 | 185.0 | 855.0 |
| release_year | 3454.0 | NaN | NaN | NaN | 2015.965258 | 2.298455 | 2013.0 | 2014.0 | 2015.5 | 2018.0 | 2020.0 |
| days_used | 3454.0 | NaN | NaN | NaN | 674.869716 | 248.580166 | 91.0 | 533.5 | 690.5 | 868.75 | 1094.0 |
| normalized_used_price | 3454.0 | NaN | NaN | NaN | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.7557 | 6.619433 |
| normalized_new_price | 3454.0 | NaN | NaN | NaN | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
Observations: There are two unique values for '4g' and '5g' columns, 34 unique values for brand name and 4 unique values for OS on which device runs
# check statistical summary of the numerical data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3454.0 | 13.713115 | 3.805280 | 5.080000 | 12.700000 | 12.830000 | 15.340000 | 30.710000 |
| main_camera_mp | 3275.0 | 9.460208 | 4.815461 | 0.080000 | 5.000000 | 8.000000 | 13.000000 | 48.000000 |
| selfie_camera_mp | 3452.0 | 6.554229 | 6.970372 | 0.000000 | 2.000000 | 5.000000 | 8.000000 | 32.000000 |
| int_memory | 3450.0 | 54.573099 | 84.972371 | 0.010000 | 16.000000 | 32.000000 | 64.000000 | 1024.000000 |
| ram | 3450.0 | 4.036122 | 1.365105 | 0.020000 | 4.000000 | 4.000000 | 4.000000 | 12.000000 |
| battery | 3448.0 | 3133.402697 | 1299.682844 | 500.000000 | 2100.000000 | 3000.000000 | 4000.000000 | 9720.000000 |
| weight | 3447.0 | 182.751871 | 88.413228 | 69.000000 | 142.000000 | 160.000000 | 185.000000 | 855.000000 |
| release_year | 3454.0 | 2015.965258 | 2.298455 | 2013.000000 | 2014.000000 | 2015.500000 | 2018.000000 | 2020.000000 |
| days_used | 3454.0 | 674.869716 | 248.580166 | 91.000000 | 533.500000 | 690.500000 | 868.750000 | 1094.000000 |
| normalized_used_price | 3454.0 | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.755700 | 6.619433 |
| normalized_new_price | 3454.0 | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
Observations:
# check for duplicates
duplicates = df.duplicated()
# print the duplicated rows
print(df[duplicates])
Empty DataFrame Columns: [brand_name, os, screen_size, 4g, 5g, main_camera_mp, selfie_camera_mp, int_memory, ram, battery, weight, release_year, days_used, normalized_used_price, normalized_new_price] Index: []
Observation: There is no dublicate in data frame.
Questions:
# Lets write function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Lets write a function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Lets check the distribution for columns that has numerical data!
Observations on 'normalized_used_price'
histogram_boxplot(df, 'normalized_used_price', figsize=(15, 10), kde=False, bins=None)
The distribution is normal distribution.
There are outliers.
Observations on 'normalized_new_price'
histogram_boxplot(df, 'normalized_new_price', figsize=(15, 10), kde=False, bins=None)
The distribution is normal distribution.
There are outliers.
screen_size
histogram_boxplot(df, 'screen_size', figsize=(15, 10), kde=False, bins=None)
main_camera_mp
histogram_boxplot(df, 'main_camera_mp', figsize=(15, 10), kde=False, bins=None)
selfie_camera_mp
histogram_boxplot(df, 'selfie_camera_mp', figsize=(15, 10), kde=False, bins=None)
int_memory
histogram_boxplot(df, 'int_memory', figsize=(15, 10), kde=False, bins=None)
ram
histogram_boxplot(df, 'ram', figsize=(15, 10), kde=False, bins=None)
weight
histogram_boxplot(df, 'weight', figsize=(15, 10), kde=False, bins=None)
battery
histogram_boxplot(df, 'battery', figsize=(15, 10), kde=False, bins=None)
days_used
histogram_boxplot(df, 'days_used', figsize=(15, 10), kde=False, bins=None)
Now, lets check the distribution for other columns.
brand_name
labeled_barplot(df, 'brand_name', perc=False, n=None)
os
labeled_barplot(df, 'os', perc=False, n=None)
4g
labeled_barplot(df, '4g', perc=False, n=None)
5g
labeled_barplot(df, '5g', perc=False, n=None)
release_year
labeled_barplot(df, 'release_year', perc=False, n=None)
# to plot density curve instead of histogram on the diag
cData_attr = df.iloc[:, 0:15]
sns.pairplot(cData_attr, diag_kind="kde") # to plot density curve instead of histogram on the diag
<seaborn.axisgrid.PairGrid at 0x7a49bd112ce0>
# drop release_year
c_list = df.select_dtypes(include=np.number).columns.tolist()
c_list.remove("release_year")
plt.figure(figsize=(15, 7))
sns.heatmap(
df[c_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Questions:
The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
plt.figure(figsize=(15, 5))
sns.boxplot(data=df, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()
A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
# create a new data frame for devices offering batteries more than 4500 mAh
df_large_battery = df[df.battery > 4500]
df_large_battery.shape
(341, 15)
plt.figure(figsize=(15, 5))
sns.boxplot(data=df_large_battery, x="brand_name", y="weight")
plt.xticks(rotation=90)
plt.show()
Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
# create new data frame for devices with screen size larger than 6 inches
df_large_screen = df[df.screen_size > 6 * 2.54]
df_large_screen.shape
(1099, 15)
labeled_barplot(df_large_screen,'brand_name', perc=False, n=None)
A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
# create data frame for devices offering greater than 8MP selfie cameras
df_selfie_camera = df[df.selfie_camera_mp > 8]
df_selfie_camera.shape
(655, 15)
labeled_barplot(df_selfie_camera,'brand_name', perc=False, n=None)
Which attributes are highly correlated with the normalized price of a used device?
numeric_df = df.select_dtypes(include=np.number) # Select only numeric columns
normalized_price_corr = numeric_df.corr()["normalized_used_price"].sort_values(ascending=False)
highly_correlated_cols = normalized_price_corr[abs(normalized_price_corr) > 0.5].index.tolist()
print("Attributes highly correlated with normalized used price:")
print(highly_correlated_cols)
Attributes highly correlated with normalized used price: ['normalized_used_price', 'normalized_new_price', 'screen_size', 'battery', 'selfie_camera_mp', 'main_camera_mp', 'ram', 'release_year']
Lets look at relationship between categorical variables and the normalized price of used devices.
# create a bar plot showing the average normalized used price for different brands within a brand_name.
plt.figure(figsize=(12, 6))
sns.barplot(x='brand_name', y='normalized_used_price', data=df)
plt.xticks(rotation=90) # Rotate x-axis labels for better readability if needed
plt.title('Average Normalized Used Price by Brand')
plt.xlabel('Brand')
plt.ylabel('Average Normalized Used Price')
plt.show()
# visualize how the prices of used devices vary across different years and compare the prices for phones
plt.figure(figsize=(10, 6))
sns.boxplot(x='release_year', y='normalized_used_price', data=df)
plt.xticks(rotation=45)
plt.title('Price Variation Across Years')
plt.xlabel('Release Year')
plt.ylabel('Normalized Used Price')
plt.show()
# visualize how prices of used devices differ between phones and tablets offering 4G and 5G networks
plt.figure(figsize=(10, 6))
network_types = ['4g', '5g']
# Concatenating the data for '4g' and '5g' networks for comparison
melted_df = pd.melt(df, value_vars=network_types, id_vars='normalized_used_price', var_name='Network', value_name='Network_Type')
sns.boxplot(x='Network_Type', y='normalized_used_price', hue='Network', data=melted_df)
plt.title('Price Comparison for 4G and 5G Devices')
plt.xlabel('Network Type')
plt.ylabel('Normalized Used Price')
plt.legend(title='Network')
plt.show()
# create a copy of the data
df1 = df.copy()
# check missing values across each columns
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# impute missing values in cols_impute with median by grouping the data on release year and brand name
cols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
for col in cols_impute:
df1[col] = df1[col].fillna(
value=df1.groupby(['release_year', 'brand_name'])[col].transform("median")
)
# checking for missing values after imputation
missing_values = df1.isnull().sum()
print(missing_values)
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 0 ram 0 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
# impute the remaining missing values in the data by the column medians grouped by brand_name
cols_impute = [
"main_camera_mp",
"selfie_camera_mp",
"battery",
"weight",
]
for col in cols_impute:
df1[col] = df1[col].fillna(
value=df1.groupby(['brand_name'])[col].transform("median")
)
# checking for missing values after imputation
missing_values_after = df1[cols_impute].isnull().sum()
print(missing_values_after)
main_camera_mp 10 selfie_camera_mp 0 battery 0 weight 0 dtype: int64
# Impute missing values in 'main_camera_mp' column with median
df1["main_camera_mp"] = df1["main_camera_mp"].fillna(df1["main_camera_mp"].median())
# Check for missing values after imputation
missing_values_after = df1["main_camera_mp"].isnull().sum()
print("Missing values in 'main_camera_mp' after imputation:", missing_values_after)
Missing values in 'main_camera_mp' after imputation: 0
Lets create a new column 'years_since_release' from the 'release_year' column.
# create a new column 'years_since_release' from the 'release_year' column
df1["years_since_release"] = 2021 - df1["release_year"]
df1.drop("release_year", axis=1, inplace=True)
df1["years_since_release"].describe()
count 3454.000000 mean 5.034742 std 2.298455 min 1.000000 25% 3.000000 50% 5.500000 75% 7.000000 max 8.000000 Name: years_since_release, dtype: float64
Lets check for outliers by using boxplot
# outlier detection by using boxplot
num_cols = df1.select_dtypes(include=np.number).columns.tolist()
num_cols_count = len(num_cols)
plt.figure(figsize=(15, 10))
rows = int(np.ceil(num_cols_count / 3)) # Adjust the number of columns per row
for i, variable in enumerate(num_cols):
plt.subplot(rows, 3, i + 1) # Adjust the number of columns per row
sns.boxplot(data=df1, x=variable)
plt.title(f"Boxplot of {variable}")
plt.tight_layout(pad=2)
plt.show()
Lets explore the data once again after manipulating it.
df1.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | days_used | normalized_used_price | normalized_new_price | years_since_release | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 127 | 4.307572 | 4.715100 | 1 |
| 1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 325 | 5.162097 | 5.519018 | 1 |
| 2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 162 | 5.111084 | 5.884631 | 1 |
| 3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 345 | 5.135387 | 5.630961 | 1 |
| 4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 293 | 4.389995 | 4.947837 | 1 |
df1.shape
(3454, 15)
df1.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3454.0 | 13.713115 | 3.805280 | 5.080000 | 12.700000 | 12.830000 | 15.340000 | 30.710000 |
| main_camera_mp | 3454.0 | 9.617597 | 4.749438 | 0.080000 | 5.000000 | 8.000000 | 13.000000 | 48.000000 |
| selfie_camera_mp | 3454.0 | 6.555067 | 6.968440 | 0.000000 | 2.000000 | 5.000000 | 8.000000 | 32.000000 |
| int_memory | 3454.0 | 54.528474 | 84.934991 | 0.010000 | 16.000000 | 32.000000 | 64.000000 | 1024.000000 |
| ram | 3454.0 | 4.032620 | 1.368598 | 0.020000 | 4.000000 | 4.000000 | 4.000000 | 12.000000 |
| battery | 3454.0 | 3132.577446 | 1298.884193 | 500.000000 | 2100.000000 | 3000.000000 | 4000.000000 | 9720.000000 |
| weight | 3454.0 | 182.636856 | 88.360445 | 69.000000 | 142.000000 | 160.000000 | 185.000000 | 855.000000 |
| days_used | 3454.0 | 674.869716 | 248.580166 | 91.000000 | 533.500000 | 690.500000 | 868.750000 | 1094.000000 |
| normalized_used_price | 3454.0 | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.755700 | 6.619433 |
| normalized_new_price | 3454.0 | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
| years_since_release | 3454.0 | 5.034742 | 2.298455 | 1.000000 | 3.000000 | 5.500000 | 7.000000 | 8.000000 |
df1.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 days_used 0 normalized_used_price 0 normalized_new_price 0 years_since_release 0 dtype: int64
Now, we dont have any missing value in our data frame.
# Assuming 'normalized_used_price' is the column representing the target variable and other columns are features in DataFrame df1
# Define the independent variables (features) X and the dependent variable (target) y
X = df1.drop('normalized_used_price', axis=1) # Exclude 'normalized_used_price' column from features
y = df1['normalized_used_price'] # Select 'normalized_used_price' as the target variable
print(X.head())
print()
print(y.head())
brand_name os screen_size 4g 5g main_camera_mp \ 0 Honor Android 14.50 yes no 13.0 1 Honor Android 17.30 yes yes 13.0 2 Honor Android 16.69 yes yes 13.0 3 Honor Android 25.50 yes yes 13.0 4 Honor Android 15.32 yes no 13.0 selfie_camera_mp int_memory ram battery weight days_used \ 0 5.0 64.0 3.0 3020.0 146.0 127 1 16.0 128.0 8.0 4300.0 213.0 325 2 8.0 128.0 8.0 4200.0 213.0 162 3 8.0 64.0 6.0 7250.0 480.0 345 4 8.0 64.0 3.0 5000.0 185.0 293 normalized_new_price years_since_release 0 4.715100 1 1 5.519018 1 2 5.884631 1 3 5.630961 1 4 4.947837 1 0 4.307572 1 5.162097 2 5.111084 3 5.135387 4 4.389995 Name: normalized_used_price, dtype: float64
# let's add the intercept to data
X = sm.add_constant(X)
# let X contains the independent features with categorical columns
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
| const | screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | days_used | normalized_new_price | ... | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 14.50 | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 127 | 4.715100 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1.0 | 17.30 | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 325 | 5.519018 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 1.0 | 16.69 | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 162 | 5.884631 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 1.0 | 25.50 | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 345 | 5.630961 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 1.0 | 15.32 | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 293 | 4.947837 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 49 columns
from sklearn.model_selection import train_test_split
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 'test_size=0.3' specifies a 70:30 split (70% train, 30% test)
# 'random_state=42' sets a seed for reproducibility
# Print the shapes of train and test sets (optional)
print("Train set shapes - X_train:", x_train.shape, "y_train:", y_train.shape)
print("Test set shapes - X_test:", x_test.shape, "y_test:", y_test.shape)
Train set shapes - X_train: (2417, 49) y_train: (2417,) Test set shapes - X_test: (1037, 49) y_test: (1037,)
print(x_train.head())
const screen_size main_camera_mp selfie_camera_mp int_memory ram \
1744 1.0 10.34 13.0 1.3 32.0 4.0
3141 1.0 10.29 8.0 2.0 32.0 4.0
1233 1.0 12.83 13.0 5.0 32.0 4.0
3046 1.0 10.24 5.0 0.3 16.0 4.0
2649 1.0 11.81 8.0 1.9 32.0 4.0
battery weight days_used normalized_new_price ... brand_name_Spice \
1744 2440.0 121.0 632 5.477969 ... 0
3141 1800.0 166.0 828 4.593604 ... 0
1233 2600.0 165.0 620 5.523179 ... 0
3046 2000.0 150.0 739 4.939640 ... 0
2649 2100.0 134.0 785 4.702751 ... 0
brand_name_Vivo brand_name_XOLO brand_name_Xiaomi brand_name_ZTE \
1744 0 0 0 0
3141 0 0 0 1
1233 0 0 0 0
3046 0 1 0 0
2649 0 0 0 0
os_Others os_Windows os_iOS 4g_yes 5g_yes
1744 0 0 0 1 0
3141 0 0 0 0 0
1233 0 0 0 0 0
3046 0 0 0 0 0
2649 0 0 0 1 0
[5 rows x 49 columns]
print(x_test.head())
const screen_size main_camera_mp selfie_camera_mp int_memory ram \
511 1.0 17.78 2.0 0.3 16.0 4.0
51 1.0 20.42 13.0 8.0 64.0 4.0
1171 1.0 12.70 13.0 5.0 32.0 4.0
2282 1.0 17.78 0.3 0.3 16.0 4.0
557 1.0 10.24 5.0 0.3 16.0 4.0
battery weight days_used normalized_new_price ... brand_name_Spice \
511 2820.0 260.0 976 4.377140 ... 0
51 6100.0 320.0 462 5.556287 ... 0
1171 2300.0 149.8 948 5.019199 ... 0
2282 3000.0 326.0 626 4.865532 ... 0
557 1700.0 96.0 831 4.866996 ... 0
brand_name_Vivo brand_name_XOLO brand_name_Xiaomi brand_name_ZTE \
511 0 0 0 0
51 0 0 0 0
1171 0 0 0 0
2282 0 0 0 0
557 0 0 0 0
os_Others os_Windows os_iOS 4g_yes 5g_yes
511 0 0 0 0 0
51 0 0 0 1 0
1171 0 0 0 1 0
2282 0 0 0 0 0
557 0 0 0 0 0
[5 rows x 49 columns]
x_train.isnull().sum()
const 0 screen_size 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 days_used 0 normalized_new_price 0 years_since_release 0 brand_name_Alcatel 0 brand_name_Apple 0 brand_name_Asus 0 brand_name_BlackBerry 0 brand_name_Celkon 0 brand_name_Coolpad 0 brand_name_Gionee 0 brand_name_Google 0 brand_name_HTC 0 brand_name_Honor 0 brand_name_Huawei 0 brand_name_Infinix 0 brand_name_Karbonn 0 brand_name_LG 0 brand_name_Lava 0 brand_name_Lenovo 0 brand_name_Meizu 0 brand_name_Micromax 0 brand_name_Microsoft 0 brand_name_Motorola 0 brand_name_Nokia 0 brand_name_OnePlus 0 brand_name_Oppo 0 brand_name_Others 0 brand_name_Panasonic 0 brand_name_Realme 0 brand_name_Samsung 0 brand_name_Sony 0 brand_name_Spice 0 brand_name_Vivo 0 brand_name_XOLO 0 brand_name_Xiaomi 0 brand_name_ZTE 0 os_Others 0 os_Windows 0 os_iOS 0 4g_yes 0 5g_yes 0 dtype: int64
import statsmodels.api as sm
# x_train and y_train contain the training data
# Fit the OLS model
olsmodel = sm.OLS(y_train, x_train).fit()
# Print the summary of the OLS model
print(olsmodel.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.849
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 277.1
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:45 Log-Likelihood: 125.15
No. Observations: 2417 AIC: -152.3
Df Residuals: 2368 BIC: 131.4
Df Model: 48
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 1.4172 0.072 19.677 0.000 1.276 1.558
screen_size 0.0295 0.004 8.352 0.000 0.023 0.036
main_camera_mp 0.0232 0.002 14.892 0.000 0.020 0.026
selfie_camera_mp 0.0116 0.001 9.924 0.000 0.009 0.014
int_memory 0.0002 6.76e-05 2.779 0.005 5.53e-05 0.000
ram 0.0305 0.005 5.797 0.000 0.020 0.041
battery -1.665e-05 7.35e-06 -2.266 0.024 -3.11e-05 -2.24e-06
weight 0.0008 0.000 5.946 0.000 0.001 0.001
days_used 3.376e-05 3.07e-05 1.101 0.271 -2.64e-05 9.39e-05
normalized_new_price 0.4104 0.012 33.494 0.000 0.386 0.434
years_since_release -0.0255 0.005 -5.589 0.000 -0.034 -0.017
brand_name_Alcatel -0.0804 0.050 -1.618 0.106 -0.178 0.017
brand_name_Apple -0.0438 0.148 -0.297 0.767 -0.333 0.246
brand_name_Asus 0.0068 0.049 0.138 0.890 -0.090 0.103
brand_name_BlackBerry 0.0312 0.072 0.434 0.664 -0.110 0.172
brand_name_Celkon -0.2372 0.068 -3.485 0.001 -0.371 -0.104
brand_name_Coolpad -0.0308 0.071 -0.434 0.664 -0.170 0.108
brand_name_Gionee -0.0130 0.059 -0.221 0.825 -0.128 0.102
brand_name_Google -0.1192 0.083 -1.442 0.149 -0.281 0.043
brand_name_HTC -0.0403 0.050 -0.805 0.421 -0.138 0.058
brand_name_Honor -0.0478 0.051 -0.941 0.347 -0.147 0.052
brand_name_Huawei -0.0599 0.046 -1.299 0.194 -0.150 0.030
brand_name_Infinix 0.1338 0.113 1.179 0.238 -0.089 0.356
brand_name_Karbonn -0.0592 0.068 -0.867 0.386 -0.193 0.075
brand_name_LG -0.0608 0.047 -1.299 0.194 -0.152 0.031
brand_name_Lava -0.0230 0.063 -0.364 0.716 -0.147 0.101
brand_name_Lenovo -0.0364 0.047 -0.771 0.441 -0.129 0.056
brand_name_Meizu -0.0860 0.056 -1.530 0.126 -0.196 0.024
brand_name_Micromax -0.0645 0.049 -1.315 0.189 -0.161 0.032
brand_name_Microsoft 0.0749 0.082 0.916 0.360 -0.085 0.235
brand_name_Motorola -0.0686 0.051 -1.349 0.177 -0.168 0.031
brand_name_Nokia 0.0380 0.052 0.724 0.469 -0.065 0.141
brand_name_OnePlus -0.0384 0.073 -0.523 0.601 -0.182 0.105
brand_name_Oppo -0.0294 0.049 -0.599 0.549 -0.126 0.067
brand_name_Others -0.0679 0.044 -1.556 0.120 -0.154 0.018
brand_name_Panasonic -0.0427 0.062 -0.690 0.490 -0.164 0.078
brand_name_Realme -0.0359 0.063 -0.568 0.570 -0.160 0.088
brand_name_Samsung -0.0617 0.045 -1.376 0.169 -0.150 0.026
brand_name_Sony -0.0756 0.053 -1.428 0.153 -0.179 0.028
brand_name_Spice -0.0356 0.068 -0.520 0.603 -0.170 0.099
brand_name_Vivo -0.0644 0.050 -1.277 0.202 -0.163 0.034
brand_name_XOLO -0.0781 0.057 -1.362 0.173 -0.191 0.034
brand_name_Xiaomi 0.0325 0.050 0.655 0.513 -0.065 0.130
brand_name_ZTE -0.0460 0.048 -0.952 0.341 -0.141 0.049
os_Others -0.0604 0.033 -1.856 0.064 -0.124 0.003
os_Windows -0.0374 0.043 -0.861 0.389 -0.122 0.048
os_iOS -0.0141 0.148 -0.095 0.924 -0.304 0.276
4g_yes 0.0406 0.016 2.514 0.012 0.009 0.072
5g_yes -0.0916 0.032 -2.846 0.004 -0.155 -0.029
==============================================================================
Omnibus: 234.465 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 630.705
Skew: -0.536 Prob(JB): 1.11e-137
Kurtosis: 5.262 Cond. No. 1.85e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.85e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Interpretation of OLS Regression Results:
R-squared (R²) and Adjusted R-squared: An R-squared value of 0.849 suggests that approximately 84.9% of the variance in the normalized price is explained by the model.
Coefficients (coef): Each coefficient represents the estimated effect of an independent variable on the dependent variable, holding other variables constant. For instance: Positive coefficients (e.g., 'screen_size', 'main_camera_mp', 'selfie_camera_mp') indicate a positive relationship with the normalized price. Negative coefficients (e.g., 'years_since_release', '5g_yes') indicate a negative relationship with the normalized price.
P-values (P>|t|): These values assess the statistical significance of each coefficient. A smaller p-value (usually < 0.05) suggests that the coefficient is statistically significant. Variables with higher p-values may be considered less significant.
Prob (F-statistic): This is the probability associated with the overall F-statistic. A low probability (< 0.05) indicates that at least one independent variable has a significant effect on the dependent variable.
Omnibus, Durbin-Watson, Jarque-Bera, Skew, Kurtosis: These statistics assess the model's assumptions and goodness of fit. For instance, skewness and kurtosis measure the departure from normality.
Condition Number: Indicates the presence of multicollinearity. A high condition number (>100 or >1000) suggests strong multicollinearity among the independent variables.
Lets check the performance of the model by using different metrics.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_train_perf = model_performance_regression(olsmodel, x_train, y_train)
olsmodel_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.229761 | 0.178533 | 0.848887 | 0.845758 | 4.293664 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_test_perf = model_performance_regression(olsmodel, x_test, y_test)
olsmodel_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.239062 | 0.188692 | 0.832176 | 0.823844 | 4.513288 |
All p values of the dummy variables about the brand of the device are greater than 0.05. Therefore, lets drop brand_name column and then recheck the performance of the model.
# List all columns related to brand_name dummy variables
brand_cols = [col for col in x_train.columns if col.startswith('brand_name_')]
# Drop all brand_name related dummy variables from x_train
x_train2 = x_train.drop(brand_cols, axis=1)
# List all columns related to brand_name dummy variables in the test dataset
brand_cols_test = [col for col in x_test.columns if col.startswith('brand_name_')]
# Drop all brand_name related dummy variables from x_test
x_test2 = x_test.drop(brand_cols_test, axis=1)
# Fit the model with the modified x_train
olsmodel2 = sm.OLS(y_train, x_train2)
olsres_1 = olsmodel2.fit()
print(
"R-squared:",
np.round(olsres_1.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_1.rsquared_adj, 3),
)
R-squared: 0.845 Adjusted R-squared: 0.844
By dropping dummy variables brandname ,adj R-squared decreased by 0.002
import statsmodels.api as sm
# x_train2 and y_train contain the training data
# Fit the OLS model
olsmodel2 = sm.OLS(y_train, x_train2).fit()
# Print the summary of the OLS model
print(olsmodel2.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.844
Method: Least Squares F-statistic: 874.0
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:45 Log-Likelihood: 96.114
No. Observations: 2417 AIC: -160.2
Df Residuals: 2401 BIC: -67.58
Df Model: 15
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.3640 0.052 26.029 0.000 1.261 1.467
screen_size 0.0290 0.003 8.385 0.000 0.022 0.036
main_camera_mp 0.0228 0.001 15.658 0.000 0.020 0.026
selfie_camera_mp 0.0120 0.001 10.699 0.000 0.010 0.014
int_memory 0.0001 6.44e-05 1.843 0.065 -7.6e-06 0.000
ram 0.0290 0.005 5.692 0.000 0.019 0.039
battery -1.317e-05 7.22e-06 -1.824 0.068 -2.73e-05 9.9e-07
weight 0.0008 0.000 5.943 0.000 0.001 0.001
days_used 4.383e-05 3.03e-05 1.449 0.148 -1.55e-05 0.000
normalized_new_price 0.4136 0.011 36.883 0.000 0.392 0.436
years_since_release -0.0280 0.004 -6.326 0.000 -0.037 -0.019
os_Others -0.0456 0.030 -1.539 0.124 -0.104 0.013
os_Windows 0.0349 0.033 1.045 0.296 -0.031 0.100
os_iOS -0.0103 0.049 -0.211 0.833 -0.105 0.085
4g_yes 0.0438 0.015 2.827 0.005 0.013 0.074
5g_yes -0.0830 0.032 -2.631 0.009 -0.145 -0.021
==============================================================================
Omnibus: 238.775 Durbin-Watson: 2.000
Prob(Omnibus): 0.000 Jarque-Bera (JB): 668.702
Skew: -0.533 Prob(JB): 6.21e-146
Kurtosis: 5.346 Cond. No. 4.08e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.08e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
We see that in new OSL regression results, dummy variables for OS types on which the device run (os_ ) has p values greater than 0.05. Therefore, lets drop the columns that have these variables and see how R-squared and adjusted R are changing.
# List all columns related to OS types (os_ prefixes)
os_cols = [col for col in x_train2.columns if col.startswith('os_')]
# Drop all OS-related columns from x_train
x_train3 = x_train2.drop(os_cols, axis=1)
# List all columns related to OS types (os_ prefixes)
os_cols = [col for col in x_test2.columns if col.startswith('os_')]
# Drop all OS-related columns from x_test
x_test3 = x_test2.drop(os_cols, axis=1)
# Fit the model with the modified x_train
olsmodel3 = sm.OLS(y_train, x_train3)
olsres_2 = olsmodel3.fit()
print(
"R-squared:",
np.round(olsres_2.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_2.rsquared_adj, 3),
)
R-squared: 0.845 Adjusted R-squared: 0.844
Both R-squared and adj. R-squared values stayed the same after dropping columns related to OS.
import statsmodels.api as sm
# x_train3 and y_train contain the training data
# Fit the OLS model
olsmodel3 = sm.OLS(y_train, x_train3).fit()
# Print the summary of the OLS model
print(olsmodel3.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.845
Model: OLS Adj. R-squared: 0.844
Method: Least Squares F-statistic: 1092.
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:45 Log-Likelihood: 94.290
No. Observations: 2417 AIC: -162.6
Df Residuals: 2404 BIC: -87.31
Df Model: 12
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.3287 0.046 28.589 0.000 1.238 1.420
screen_size 0.0312 0.003 9.939 0.000 0.025 0.037
main_camera_mp 0.0232 0.001 16.013 0.000 0.020 0.026
selfie_camera_mp 0.0119 0.001 10.640 0.000 0.010 0.014
int_memory 0.0001 6.44e-05 1.816 0.069 -9.31e-06 0.000
ram 0.0310 0.005 6.246 0.000 0.021 0.041
battery -1.452e-05 7.18e-06 -2.023 0.043 -2.86e-05 -4.45e-07
weight 0.0007 0.000 5.777 0.000 0.000 0.001
days_used 4.661e-05 3.02e-05 1.544 0.123 -1.26e-05 0.000
normalized_new_price 0.4142 0.011 37.628 0.000 0.393 0.436
years_since_release -0.0273 0.004 -6.197 0.000 -0.036 -0.019
4g_yes 0.0442 0.015 2.858 0.004 0.014 0.075
5g_yes -0.0884 0.031 -2.823 0.005 -0.150 -0.027
==============================================================================
Omnibus: 240.010 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 636.392
Skew: -0.552 Prob(JB): 6.45e-139
Kurtosis: 5.258 Cond. No. 3.42e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.42e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
We will be checking the following Linear Regression assumptions:
1- No Multicollinearity
2-Linearity of variables
3-Independence of error terms
4-Normality of error terms
5-No Heteroscedasticity
Lets test for multicollinearity by using VIF.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for predictors in the training set (X_train3)
vif_series_train = pd.Series(
[variance_inflation_factor(x_train3.values, i) for i in range(x_train3.shape[1])],
index=x_train3.columns,
)
print("VIF values for training set:\n")
print(vif_series_train)
VIF values for training set: const 95.880541 screen_size 6.462515 main_camera_mp 2.055485 selfie_camera_mp 2.565127 int_memory 1.224326 ram 2.008373 battery 3.836712 weight 5.554191 days_used 2.472959 normalized_new_price 2.570473 years_since_release 4.505587 4g_yes 2.310097 5g_yes 1.695259 dtype: float64
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmod = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmod.rsquared_adj)
rmse.append(np.sqrt(olsmod.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
col_list = ['screen_size'] # 'screnn_size' contains the columns with high VIF
# Check the effect on model performance after dropping specified columns from the training data
res = treating_multicollinearity(x_train3, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | screen_size | 0.837871 | 0.238037 |
col_to_drop = 'screen_size'
# Drop the specified column from the train and test data
x_train4 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test4 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF after dropping the specified column
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(X):
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return vif_data
vif_result = calculate_vif(x_train4)
print(vif_result)
Feature VIF 0 const 88.136335 1 main_camera_mp 2.044983 2 selfie_camera_mp 2.562639 3 int_memory 1.220129 4 ram 1.970646 5 battery 3.555564 6 weight 2.737859 7 days_used 2.467377 8 normalized_new_price 2.512565 9 years_since_release 4.389092 10 4g_yes 2.309453 11 5g_yes 1.678568
# x_train4 and y_train contain the training data
# Fit the OLS model
olsmodel4 = sm.OLS(y_train, x_train4).fit()
# Print the summary of the OLS model
print(olsmodel4.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 1136.
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:46 Log-Likelihood: 45.628
No. Observations: 2417 AIC: -67.26
Df Residuals: 2405 BIC: 2.228
Df Model: 11
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.4599 0.045 32.118 0.000 1.371 1.549
main_camera_mp 0.0242 0.001 16.435 0.000 0.021 0.027
selfie_camera_mp 0.0122 0.001 10.739 0.000 0.010 0.014
int_memory 7.954e-05 6.56e-05 1.212 0.226 -4.91e-05 0.000
ram 0.0377 0.005 7.529 0.000 0.028 0.048
battery 4.79e-06 7.05e-06 0.680 0.497 -9.03e-06 1.86e-05
weight 0.0016 8.94e-05 17.948 0.000 0.001 0.002
days_used 3.235e-05 3.08e-05 1.052 0.293 -2.8e-05 9.27e-05
normalized_new_price 0.4306 0.011 38.787 0.000 0.409 0.452
years_since_release -0.0344 0.004 -7.742 0.000 -0.043 -0.026
4g_yes 0.0416 0.016 2.639 0.008 0.011 0.073
5g_yes -0.1193 0.032 -3.752 0.000 -0.182 -0.057
==============================================================================
Omnibus: 250.325 Durbin-Watson: 1.992
Prob(Omnibus): 0.000 Jarque-Bera (JB): 557.189
Skew: -0.628 Prob(JB): 1.02e-121
Kurtosis: 4.988 Cond. No. 3.30e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.3e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
# Drop the 'battery' column from both train and test data
x_train5 = x_train4.drop('battery', axis=1)
x_test5 = x_test4.drop('battery', axis=1)
# x_train5 and y_train contain the training data
# Fit the OLS model
olsmodel5 = sm.OLS(y_train, x_train5).fit()
# Print the summary of the OLS model
print(olsmodel5.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.839
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 1250.
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:46 Log-Likelihood: 45.396
No. Observations: 2417 AIC: -68.79
Df Residuals: 2406 BIC: -5.099
Df Model: 10
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.4640 0.045 32.496 0.000 1.376 1.552
main_camera_mp 0.0243 0.001 16.708 0.000 0.021 0.027
selfie_camera_mp 0.0122 0.001 10.777 0.000 0.010 0.014
int_memory 8e-05 6.56e-05 1.220 0.223 -4.86e-05 0.000
ram 0.0379 0.005 7.577 0.000 0.028 0.048
weight 0.0017 6.01e-05 27.463 0.000 0.002 0.002
days_used 3.141e-05 3.07e-05 1.023 0.307 -2.88e-05 9.17e-05
normalized_new_price 0.4313 0.011 38.987 0.000 0.410 0.453
years_since_release -0.0351 0.004 -8.103 0.000 -0.044 -0.027
4g_yes 0.0433 0.016 2.773 0.006 0.013 0.074
5g_yes -0.1201 0.032 -3.779 0.000 -0.182 -0.058
==============================================================================
Omnibus: 251.496 Durbin-Watson: 1.992
Prob(Omnibus): 0.000 Jarque-Bera (JB): 559.807
Skew: -0.631 Prob(JB): 2.75e-122
Kurtosis: 4.992 Cond. No. 7.05e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.05e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Drop the 'int_memory' column from both train and test data
x_train6 = x_train5.drop('int_memory', axis=1)
x_test6 = x_test5.drop('int_memory', axis=1)
# x_train6 and y_train contain the training data
# Fit the OLS model
olsmodel6 = sm.OLS(y_train, x_train6).fit()
# Print the summary of the OLS model
print(olsmodel6.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.838
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 1388.
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:46 Log-Likelihood: 44.649
No. Observations: 2417 AIC: -69.30
Df Residuals: 2407 BIC: -11.40
Df Model: 9
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.4639 0.045 32.489 0.000 1.376 1.552
main_camera_mp 0.0241 0.001 16.673 0.000 0.021 0.027
selfie_camera_mp 0.0125 0.001 11.144 0.000 0.010 0.015
ram 0.0369 0.005 7.477 0.000 0.027 0.047
weight 0.0016 5.99e-05 27.448 0.000 0.002 0.002
days_used 2.993e-05 3.07e-05 0.975 0.330 -3.03e-05 9.01e-05
normalized_new_price 0.4338 0.011 39.952 0.000 0.413 0.455
years_since_release -0.0354 0.004 -8.182 0.000 -0.044 -0.027
4g_yes 0.0413 0.016 2.659 0.008 0.011 0.072
5g_yes -0.1139 0.031 -3.631 0.000 -0.175 -0.052
==============================================================================
Omnibus: 248.554 Durbin-Watson: 1.992
Prob(Omnibus): 0.000 Jarque-Bera (JB): 588.821
Skew: -0.605 Prob(JB): 1.38e-128
Kurtosis: 5.093 Cond. No. 7.04e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.04e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
col_to_drop = 'days_used'
# Drop the specified column from the train and test data
x_train7 = x_train6.loc[:, ~x_train6.columns.str.startswith(col_to_drop)]
x_test7 = x_test6.loc[:, ~x_test6.columns.str.startswith(col_to_drop)]
# Check VIF after dropping the specified column
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(X):
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return vif_data
vif_result = calculate_vif(x_train7)
print(vif_result)
Feature VIF 0 const 84.310306 1 main_camera_mp 1.948906 2 selfie_camera_mp 2.475466 3 ram 1.911049 4 weight 1.228056 5 normalized_new_price 2.400419 6 years_since_release 2.638205 7 4g_yes 2.192522 8 5g_yes 1.621826
Lets check R- squared and adj. R-squared values and OLS regression results after we drop 'days_used' column.
# Fit the model with the modified x_train7
olsmodel7 = sm.OLS(y_train, x_train7)
olsres_6 = olsmodel7.fit()
print(
"R-squared:",
np.round(olsres_6.rsquared, 3),
"\nAdjusted R-squared:",
np.round(olsres_6.rsquared_adj, 3),
)
# Further evaluate the model performance on the test set using x_test7
R-squared: 0.838 Adjusted R-squared: 0.838
# x_train7 and y_train contain the training data
# Fit the OLS model
olsmodel7 = sm.OLS(y_train, x_train7).fit()
# Print the summary of the OLS model
print(olsmodel7.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.838
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 1562.
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:46:46 Log-Likelihood: 44.172
No. Observations: 2417 AIC: -70.34
Df Residuals: 2408 BIC: -18.23
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.4710 0.044 33.088 0.000 1.384 1.558
main_camera_mp 0.0243 0.001 16.921 0.000 0.021 0.027
selfie_camera_mp 0.0124 0.001 11.105 0.000 0.010 0.015
ram 0.0368 0.005 7.460 0.000 0.027 0.046
weight 0.0016 5.99e-05 27.481 0.000 0.002 0.002
normalized_new_price 0.4334 0.011 39.943 0.000 0.412 0.455
years_since_release -0.0328 0.003 -9.528 0.000 -0.040 -0.026
4g_yes 0.0433 0.015 2.814 0.005 0.013 0.073
5g_yes -0.1166 0.031 -3.731 0.000 -0.178 -0.055
==============================================================================
Omnibus: 251.097 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 597.486
Skew: -0.610 Prob(JB): 1.81e-130
Kurtosis: 5.109 Cond. No. 1.92e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.92e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
'ram', 'normalized_new_price', 'main_camera_mp', 'selfie_camera_mp', and 'weight' seem to have relatively low p-values, suggesting they might be statistically significant predictors.
Additionally, the 'years_since_release' variable has a negative coefficient, indicating an inverse relationship with the target.
After dropping the features causing strong multicollinearity and the statistically insignificant ones, our model performance hasn't dropped sharply (adj. R-squared has dropped from 0.846 to 0.838). This shows that these variables did not have much predictive power.
print(x_train7.shape[1]) # This will print the number of columns (features) in x_train7
9
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel2_train_perf = model_performance_regression(olsmodel7, x_train7, y_train)
olsmodel2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.237589 | 0.184111 | 0.838414 | 0.83781 | 4.455122 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel2_test_perf = model_performance_regression(olsmodel7, x_test7, y_test)
olsmodel2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.239989 | 0.188992 | 0.830872 | 0.82939 | 4.534151 |
Additionally, the R-squared and adjusted R-squared values are relatively high, suggesting that a significant proportion of the variance in the target variable is explained by the model.
Consistent performance between training and test data indicates that the model has not overfitted or underfitted significantly. This balance is essential for the model's generalization to new, unseen data.
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmodel2.fittedvalues # predicted values
df_pred["Residuals"] = olsmodel2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 1744 | 4.261975 | 4.299760 | -0.037785 |
| 3141 | 4.175156 | 3.863044 | 0.312112 |
| 1233 | 4.117410 | 4.422986 | -0.305576 |
| 3046 | 3.782597 | 3.838826 | -0.056228 |
| 2649 | 3.981922 | 3.908022 | 0.073900 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
sns.histplot(data=df_pred, x='Residuals', kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
# Assuming 'residuals' contains the residuals data
stats.probplot(df_pred['Residuals'], dist="norm", plot=pylab)
plt.title("Q-Q plot of residuals")
plt.show()
from scipy.stats import shapiro
# Assuming 'residuals' contains the residuals data
stat, p_value = shapiro(df_pred['Residuals'])
print("Shapiro-Wilk Test:")
print(f"Test Statistic: {stat}")
print(f"P-value: {p_value}")
Shapiro-Wilk Test: Test Statistic: 0.9635675549507141 P-value: 3.1806631095935104e-24
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train7)
lzip(name, test)
[('F statistic', 0.9465723563939281), ('p-value', 0.8290872249127756)]
olsmodel_final = sm.OLS(y_train, x_train7).fit()
print(olsmodel_final.summary())
OLS Regression Results
=================================================================================
Dep. Variable: normalized_used_price R-squared: 0.838
Model: OLS Adj. R-squared: 0.838
Method: Least Squares F-statistic: 1562.
Date: Sun, 19 Nov 2023 Prob (F-statistic): 0.00
Time: 02:50:22 Log-Likelihood: 44.172
No. Observations: 2417 AIC: -70.34
Df Residuals: 2408 BIC: -18.23
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 1.4710 0.044 33.088 0.000 1.384 1.558
main_camera_mp 0.0243 0.001 16.921 0.000 0.021 0.027
selfie_camera_mp 0.0124 0.001 11.105 0.000 0.010 0.015
ram 0.0368 0.005 7.460 0.000 0.027 0.046
weight 0.0016 5.99e-05 27.481 0.000 0.002 0.002
normalized_new_price 0.4334 0.011 39.943 0.000 0.412 0.455
years_since_release -0.0328 0.003 -9.528 0.000 -0.040 -0.026
4g_yes 0.0433 0.015 2.814 0.005 0.013 0.073
5g_yes -0.1166 0.031 -3.731 0.000 -0.178 -0.055
==============================================================================
Omnibus: 251.097 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 597.486
Skew: -0.610 Prob(JB): 1.81e-130
Kurtosis: 5.109 Cond. No. 1.92e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.92e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations:
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel7, x_test7, y_test)
olsmodel_final_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.239989 | 0.188992 | 0.830872 | 0.82939 | 4.534151 |
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(olsmodel7, x_train7, y_train)
olsmodel_final_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.237589 | 0.184111 | 0.838414 | 0.83781 | 4.455122 |
Additionally, the R-squared and adjusted R-squared values are relatively high, suggesting that a significant proportion of the variance in the target variable is explained by the model.
Consistent performance between training and test data indicates that the model has not overfitted or underfitted significantly. This balance is essential for the model's generalization to new, unseen data.
Lets print the linear regression equation.
# let's check the model parameters
olsres_6.params
const 1.471000 main_camera_mp 0.024310 selfie_camera_mp 0.012399 ram 0.036821 weight 0.001646 normalized_new_price 0.433439 years_since_release -0.032820 4g_yes 0.043276 5g_yes -0.116627 dtype: float64
# Let us write the equation of linear regression
Equation = "Normalized Used Price ="
print(Equation, end=" ")
for i in range(len(x_train7.columns)):
if i == 0:
print(olsres_6.params[i], "+", end=" ")
elif i != len(x_train7.columns) - 1:
print(
olsres_6.params[i],
"* (",
x_train7.columns[i],
")",
"+",
end=" ",
)
else:
print(olsres_6.params[i], "* (", x_train7.columns[i], ")")
Normalized Used Price = 1.4709995513070717 + 0.02430970340877544 * ( main_camera_mp ) + 0.012398536935614223 * ( selfie_camera_mp ) + 0.03682117135984637 * ( ram ) + 0.0016462076251945706 * ( weight ) + 0.4334394322552675 * ( normalized_new_price ) + -0.03282027252875066 * ( years_since_release ) + 0.04327551997499909 * ( 4g_yes ) + -0.1166271600845224 * ( 5g_yes )
Based on the regression model results, here are actionable insights and recommendations for ReCell:
Pricing Strategy:
New Price Influence: The normalized price of a new device significantly affects the price of a used/refurbished device. Keep a close eye on market trends in new device pricing to adjust used device pricing accordingly.
Camera Quality: Both main and selfie camera resolutions have a positive influence on the device's price. Highlight the camera quality when marketing devices with higher resolutions.
RAM and Weight Impact: Higher RAM positively affects price, while weight slightly influences it. Emphasize devices with larger RAM capacities and consider the weight factor as a secondary selling point.
Technological Advancements: The presence of 4G positively affects prices, while 5G has a negative impact. Leverage 4G capabilities in marketing and consider implications of 5G adoption on pricing strategies.
Lifecycle and Market Dynamics:
Device Age: Older devices face a price drop, as indicated by the negative coefficient for 'years_since_release.' Adjust pricing policies to reflect the decreasing value of older models.
Market Adaptation: Continuously evaluate market dynamics, technological advancements, and consumer preferences to adapt pricing strategies accordingly. Monitor and update the model periodically to incorporate changing trends.
Business Considerations:
Warranty and Assurance: Offer warranties and guarantees on used devices to enhance consumer trust and drive sales.
Environmental Angle: Promote the environmental benefits of buying used/refurbished devices. Highlight recycling initiatives and how purchasing second-hand devices reduces electronic waste.
Partnerships and Offers: Collaborate with third-party vendors/platforms to offer attractive deals and benefits for refurbished devices. Forge partnerships with service providers to enhance customer offers.
By implementing these insights, ReCell can formulate a robust pricing strategy, enhance marketing approaches, and cater to the evolving demands of the used/refurbished device market, maximizing their market share and profitability.
-